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ANALYSIS OF A DATA MATRIX AND A GRAPH: 
METAGENOMIC DATA AND THE PHYLOGENETIC TREE 

By Elizabeth Purdom^ 

University of California, Berkeley 

In biological experiments researchers often have information in 
the form of a graph that supplements observed numerical data. In- 
corporating the knowledge contained in these graphs into an analysis 
of the numerical data is an important and nontrivial task. We look 
at the example of metagenomic data — data from a genomic survey 
of the abundance of different species of bacteria in a sample. Here, 
the graph of interest is a phylogenetic tree depicting the interspecies 
relationships among the bacteria species. We illustrate that analy- 
sis of the data in a nonstandard inner-product space effectively uses 
this additional graphical information and produces more meaningful 
results. 

1. Introduction. Relationships among either observations or variables 
are often conveniently summarized by a graph. Incorporating this outside 
information into the analysis of numerical data is of increasing interest, par- 
ticularly in biology where many known properties of genes and proteins are 
described by complicated networks. A common situation is to have numeri- 
cal data from an experiment which is of primary interest and also additional 
knowledge in the form of a graph relating our observations or variables from 
the experiment. We would like to incorporate the information in the graph 
with our analysis of the experimental data. By including the graphical infor- 
mation directly in our analysis, we constrain the space of possible solutions 
to those that are relevant from the point of view of the known information. 

The specific type of graph which we consider here is a phylogenetic tree. 
A phylogenetic tree is a ubiquitous graph in biology that describes the evolu- 
tionary relationship between a set of species. We are motivated to consider 
this graph by our work with Eckburg et al. (2005) analyzing differences 
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in bacterial composition based on a genomic inventory of different samples. 
Such "metagenomic" studies are a popular technique for measuring bacterial 
content. As we argue below, using the phylogenetic information regarding 
the discovered bacteria is key in creating a meaningful analysis — particularly 
because of the small sample size relative to the number of bacteria found. 

There are numerous different strategies for using graphical information, 
such as Bayesian networks and differential equation modeling; they require 
varying degrees of specificity in the graphical information. We focus here 
on a technique that is simple to implement and uses the graph to define 
a nonstandard inner-product space in MP to perform the analysis of the 
numerical data. 

The layout of the paper is as follows. First we will introduce the moti- 
vating example of bacterial composition in more detail and will return to 
the example at the end to demonstrate the techniques on the bacterial data. 
We review how PCA can be succinctly reformulated for nonstandard inner- 
products and its development for ecological studies of species abundance, 
a reformulation we will call generalized PCA (gPCA). The rest of the paper 
delves further into the implications of incorporating outside graphical infor- 
mation through the use of such a metric space. In particular, we give an 
appropriate metric for a phylogenetic tree and evaluate the implications of 
that choice in the final data analysis. Throughout, we focus on the example 
of the phylogenetic tree and metagenomic data to illustrate the concepts. 
However, the same basic approach can be useful in including nonstandard 
forms of knowledge — other types of graphical information in particular. 

Notation. In all that follows, we will use boldface type to indicate vectors 
and matrices and parenthetical subscripts to indicate elements of vectors 
and matrices. Therefore, the jth component of a vector Xj will be given 
as Xj(j) and the i,j element of a matrix A will be given as ^(ij)- 

2. Motivating example. In Eckburg et al. (2005) the broad goal was to 
describe the kinds of bacteria found in the intestinal tract and compare the 
bacterial communities found in different people. To that end, each of the 
three patients in the study had biopsies taken at six locations in his/her 
colon in addition to providing a stool sample. Each of these seven samples 
(per patient) was then subjected to genomic techniques to try to quantify 
the different types of bacteria as well as their abundance. 

Traditional techniques for identifying bacteria require growing the bacte- 
ria in a culture and then classifying the bacteria as a species based on any 
observable characteristics as well as the nutrients needed for it to grow. This 
gives only limited ability to assess the presence of different types of bacteria. 
The increased ease of DNA sequencing has led researchers to classify bacte- 
ria by genomic information ( "metagenomics" ) . We focus here on the results 
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of sequencing a specific gene (16S rDNA) found in bacteria. A random se- 
lection of all the copies of the gene present in the sample are sequenced. 
Ideally, each version of the gene could be uniquely identified as coming from 
a specific bacteria and the abundance of the different gene versions would 
give an estimate of the abundance of each bacteria. In reality, we do not 
have a direct link between a gene version and its originating bacteria, but 
only an estimate of it, as we explain more fully below. 

Bacteria species also share an evolutionary history which might affect 
their biological role in the sample. We summarize the evolutionary relation- 
ship by a phylogenetic tree that describes the evolutionary history of the 
bacterial species. We visualize both the phylogenetic tree relating the bac- 
terial species and their numerical abundance in Figure 1. There is a great 
deal of sparsity in the data; many species are present in low numbers and 
in only a few samples. At the same time, there are some highly abundant 
species found at high levels in most samples. From this visual inspection, 
we can also see the importance of jointly considering both aspects of the 
data — entire regions of the phylogenetic tree appear dissimilar between the 
patients, such as the Bacteriodetes phylum (colored shades of blue) where 
patient A has much less abundance across all of his/her samples than the 
other two patients. 

Given the large number of species (395) as compared to the number of 
samples (21), we could reorder the species and find other sets of species that 
are also very different across the patients. However, the clusters defined by 
the phylogenetic tree provide biological information regarding the relation- 
ships among the species that is separate from the numerical abundances. 
Patterns of sparsity or differences among the patients following the clusters 
in the tree are generally of greater interest than an arbitrary grouping since 
there is known biological meaning to the groupings. The additional informa- 
tion found by using the phylogenetic similarities can serve as a check on the 
kind of relationships among the species that we are interested in. This will 
be particularly important since we have so many more species than samples. 
Focusing the analysis to follow the structure of the tree will allow for more 
meaningful results. 

This study was exploratory. It was the first sequence-based analysis of the 
bacterial composition of the colon that compared between individuals and/or 
locations of the sample (many genomic experiments of this type either sam- 
pled only one patient or pooled patients together). The list of phylotypes 
found and their relationship to known bacterial taxa was biologically infor- 
mative. In addition to creating an inventory, the goals of the experiment 
were to better describe the bacteria communities and their differences along 
the intestinal tract or between patients. With the small sample size, the 
analysis cannot extrapolate to the population in general but can only focus 
on describing the patients observed. 



E. PURDOM 




Fig. 1. Depiction of the abundance matrix from Eckburg et al. (2005). Columns indi- 
cate samples, grouped by patient, and rows correspond to different phylotypes. The grey 
scale indicates the level of abundance on a log scale (see legend for conversion to original 
abundances) . The colors on the phylogenetic tree indicate phylum, as in Eckburg et al. 
(2005), but with a different choice of colors: blue — Bacteriodetes, green — Firmicutes, pur- 
ple — various Proteobacteria, pink — Verrucomicrobia. We additionally colored two portions 
of the Bacteriodetes phylum (blue) separately: roughly identifiable as Prevotallae and B. 
vulgatus, they are colored lightest blue and darkest blue, respectively. Also, we colored the 
Firmicutes (green) with two different shades for B. Mollicutes and Clostridia (dark green 
and light green, respectively). 
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2.1. Effect of imperfect species definition. In practice, we cannot identify 
a bacterial species from the DNA sequence. Instead the sequences are them- 
selves used to define the species, based on the sequence similarity of different 
copies — for example, the rule in Eckburg et al. (2005) for grouping sequences 
into one "species" required all pairs in the group to have a minimum of 99% 
sequence similarity. For this reason, the term "phylotype" is used instead 
of species to indicate that these are merely proxies for the true species dis- 
tinctions. A phylogenetic tree for the phylotypes was built using maximum 
likelihood estimation of the tree [Felsenstein (1981)]. Specifically, the tree 
was built using a representative instance of the 16S rDNA sequence from 
each phylotype, generally a consensus sequence of those sequences classified 
into that phylotype. 

The possible effect of using an arbitrary cutoff for defining phylotypes is 
seen in Figure 1, where the length of the tree branch reflects the similarity 
between the species. Some phylotypes clearly form tight bunches of very 
similar phylotypes, particularly in the Clostridia family of the Firmicutes 
phylum (light green). If we had changed the cutoff for deflning phylotypes, 
we could imagine these groups collapsing into a few distinct phylotypes. 
Therefore, we need to be careful to have an analysis that is robust to such 
small changes and does not count each phylotype as equally important. 

The relationship between DNA sequences can be summarized in different 
ways, such as its similarity to other sequences, the phylotype to which it has 
been assigned, or its location in a phylogenetic tree built between different 
sequences. The analysis discussed in detail here will reduce the sequence 
data to the phylotype-level, ignoring the individual sequence data: each of 
the = 11,831 observations (or sequenced strands of DNA) belongs to one 
of S = 395 phylotypes (or species) and one of the L = 21 locations. 

3. Incorporation of additional information via inner-products. For ob- 
served data Xj G MP we propose to use nonstandard inner-products or metrics 
in analyzing the data. We argue that this is a simple way to include com- 
plicated outside information, such as graphical information, in the analysis 
of high-dimensional data. 

By nonstandard inner-products, we specifically mean an inner-product 
between two observations i and j given by (xj,Xj)Q =Xj"^Qyj. Since Q 
also defines a metric based on ||xj — XjHq, we may at times refer to Q 
as a metric. For any inner-product (•,•) and a fixed set of n vectors Xj, 
there exists a matrix Q so that (xj,Xj) = x^-^Qxj, so this is a quite general 
definition. A common example of such an inner-product is the Mahalanobis 
distance, where Q is chosen as the inverse covariance matrix of the observed 
random vectors [see Maesschalck, Jouan- Rimbaud and Massart (2000)]. In 
this case, the choice of Q = where is the covariance matrix of the 
observed variables, removes the correlation among the variables, also known 
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as "sphering" the data. This is the most common choice of a nontrivial Qp 
and is used, for example, in discriminant analysis for classification problems. 

The choice of an appropriate metric Q, however, can also be a method 
for including outside information. In particular, assume that the additional 
information, such as the phylogenetic tree, is such that one can model the 
covariance structure S for the variables that this information would imply. 
The resulting covariance matrix, 5], is not the covariance for the observed 
variables in our data — which is the result of a much more complicated re- 
lationship between the graph and the data — but rather what would be ex- 
pected if the data was completely created by this outside process. In order to 
evaluate the data so as to give priority to relationships in the phylogenetic 
tree, we propose using the metric Qp = S for the variable space. Performed 
in this space, the analysis focuses on the aspects of the data variables most 
congruent with the SI. 

Because most multivariate techniques are based on inner-products, they 
are easily generalized to a more general inner-product space. We will focus 
on PCA using Q, a technique known as generalized PCA (gPCA) or the 
duality principle [Escoufier (1987); Holmes (2008); Dray and Dufour (2007)]; 
Jolliffe (2002) gives a more in depth overview of gPCA, connecting gPCA 
with other techniques. We give a short review of gPCA before we discuss 
more fully the interpretation of this strategy. Other multivariate methods 
have been similarly extended and would be also relevant for incorporation 
of outside information. 

3.1. Generalized PCA. Quite generally, gPCA is an ordination proce- 
dure, that is, each observed data point x € is transformed to new, lower- 
dimensional data coordinates given by x G M'^ which is a linear transfor- 
mation of the original coordinates: x = Z"^x for some matrix Z G RP^'^. 
Most multivariate techniques are ordination procedures, common examples 
being PCA, Canonical Correlation Analysis and Correspondence Analysis. 
The differences lie in the choice of the linear transformation (Z), which is 
chosen based on the desired properties of the new, lower-dimensional vec- 
tor X. The most familiar example is standard PCA which seeks succes- 
sive vectors G W so that the resulting jth coordinate, x^^) = (x,aj), has 
the largest variance, subject to being independent of previous coordinates 
X(i), . . . ,X(^j_iy, the final transformation matrix is Z = = (ai • • • a^). 

The ordination procedure of generalized PCA (gPCA) is a generalization 
of PCA in that it assumes an alternative inner-product for the data vec- 
tors X. We assume an observed random variable x lies in MP with a known 
inner-product defined by Qp G M^^^'. Then in analogy with standard prin- 
cipal components, gPCA can be developed from the perspective of finding 
the vector a that maximizes the population quantity, var((a,x)Qp), with a 
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constrained to have unit Qp-norm and successive aj constrained to be Qp- 
orthogonal to the preceding slj, 



for X are then given by x = A^^QpX (so Z = A^Qp in the notation given 
above). As in PCA, the wih be eigenvectors, but now of the matrix 
where ^ is the covariance matrix of x. The matrix ^Qp is not symmetric, 
but because Qp is fuh rank, this is a weh defined, positive definite generahzed 
eigenequation, and the eigenvectors of *Qp can be chosen to be a Qp- 
orthogonal set of vectors [see Golub and van Loan (1996)]. 

Just as in PCA, there are multiple developments that result in the same 
ordination procedure. For example, gPCA provides the best /c-dimensional 
approximation to the inter-point similarities when the similarities are cal- 
culated in the appropriate metric space. In particular, we could note that if 
distances between observation i and j are given by 



then gPCA is equivalent to Multidimensional Scaling (MDS) of the n obser- 
vations based on these distances. Similarly, for any Qp, there exists a (non- 
unique) matrix C so that Qp = CC^, which means gPCA of x based on Qp 
is equivalent to first transforming the vector x by C and then performing 
PCA on the resulting vector C"^x. 

Metric for the columns. As an analysis of a n x p data matrix X, the 
above presentation only considered a metric for the space of row vectors 
(observations) of X. There can also be a relevant metric for comparison of 
the variables, a simple example being when there are weights assigned to 
the observations. Generalized PCA goes beyond the description given so far 
and allows also for a metric Q„ € M""^"" for the space of the column vectors 
of X. These combinations of choices are generally abbreviated as the triplet 
(X, Qp, Q„) [see Escoufier (1987) for a more general explanation of the role 
of two separate metrics when viewing X as an operator simultaneously in 
and M"]. We note that in many cases either Q„, or Qp are chosen to be diag- 
onal, in which case they simplify to weights on the observations or variables, 
respectively. 

Returning to the population development above, the inclusion of a metric 
for the columns of X is incorporated in the estimation of In order to 
maximize the quantity var((a, x)qp), we must estimate ^ from our data 
matrix X; we include the metric Q^ for the columns in our estimate so that 
^ = X^Q„X. Then our estimates of a^ are given by the eigenvectors of 
X^Q^XQp. A geometric development that includes the metric Q„ for the 
columns shows that gPCA best preserves the total inner-point similarities of 



where, again, A^ 




c^(^,j) = (xi - Xj) Qp(xi-Xj), 
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the data matrix X when using a measure of inner-point similarities incorpo- 
rating the row and column matrix known as the inertia (see Appendix C). 
In addition to the geometric view of gPCA, Jolliffe (2002) notes that gPCA 
with Q„ a diagonal matrix provides the maximum likelihood estimates of 
the fixed effects version of a factor model, 

X = Az -|- e, 

where e ~iV(0,(T2Q-iQ-i). 

Connection between analysis of the rows and columns. In some data set- 
tings either the rows or the columns can be meaningfully considered as the 
observations, such as analysis of large contingency tables that are our mo- 
tivating example. Furthermore, the importance of the different variables in 
describing a low-dimensional representation of the observations is a common 
part of PCA. A gPCA of the columns of X, also reduced to k dimensions, 
is technically the gPCA of triplet (Y = X-^, Q„, Qp) and results in new co- 
ordinates for the columns given by Y = YQ,jBfc G M^^*^. 

Again in analogy to PCA, a generalized form of the SVD of X yields 
the solutions to gPCA on both the columns or the rows simultaneously. 
If the rank of X = r, we can write X = BA^^^A"^, where A E M^^'' and 
B G M"^^, and the columns of B are Qn-orthogonal and the columns of A 
are Qp-orthogonal. Then B gives the solutions to the gPCA of the columns 
as observations, while A gives the solutions to the gPCA of the rows as 
observations. The corresponding eigenequations are 

X^Q„XQpA = AA, 

XQpX^Q„B = BA, 

— 1/2 

and for any choice of k, B/^. = XQpA^A^ , where {■)k refers to the matrix 
with the first k columns or diagonal elements, as appropriate. 

This means the new coordinates from a gPCA of the rows can be com- 
pletely determined by the new coordinates from a gPCA of the columns of 
the data matrix. Let x E M'"' be the new coordinates for a vector x G 
based on the gPCA of the rows of X. The new coordinates are given as 

x^ = x^QpYA-^/l 

Put another way, the value of the j new coordinates of x is given by 

% = (x,Xj>Qp, 

— 1/2 

where Xj is the jth column of YA^ , that is, the column of Y normalized 
to have standard deviation one. Thus, the jth coordinate of x is a measure 
of the similarity of x with the jth variable defining the reduced space of the 
columns. 
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3.2. Interpretation of nonstandard metrics. Using a metric for W has an 
obvious rationale when the metric is a diagonal, implying different weights 
for different variables, or when the metric is where ^ is the covariance 
of the variables (Mahalanobis distance). However, it is not immediately clear 
why a particular matrix Qp, such as Qp = S as we propose above, would 
improve a given data analysis. One intuitive rationale for this comes from 
thinking of the metric as defining a harmonic analysis of the data in the 
direction of the eigenvectors of Qp. This is the perspective of Rapaport 
et al. (2007) in their proposal for the particular case of general graphs (see 
Section 7). 

Outside information, such as our phylogeny, when represented by Xl also 
defines a basis given by the eigenvectors vj of S. The eigenvectors decompose 
our overall covariance into hopefully informative directions with regards to 
our outside structure, and the Vj can be ordered based on their overall 
contribution to Xl based on the eigenvalues Xj. The directions given by 
the Vj can be weighted in different ways to create a family of metrics, with 
each choice of weighting system emphasizing different directions. 

More precisely, suppose 5] has an eigendecomposition given by VAV^; 
V is a p X p matrix with columns vj consisting of the eigenvectors of 5], 
and A is a diagonal matrix of eigenvalues Xj. The vectors Vj form a basis 
for MP and, therefore, a data vector x can be written as 

X = ^(Vj,x)Vj = Vx, 

i 

where xq) = Vj-'^x gives the magnitude of x in the direction of the eigenvec- 
tors of 5]. 

This decomposition of x into its contributions due to the directions given 
by Vj creates no loss of information, being only a change of basis. But we 
can transform the original x by giving weights W(j) to different directions 
in order to give more emphasis to the features that Vj represents, in which 
case we now have a new vector G W with 

fw(x) = ^W(j)X(j)Vj = VDwX, 

j 

where is the diagonal matrix with diagonal given by w. For example, if 
our outside structure could be represented in a smaller subspace so that S 
had rank r <p, then defining w^j) = l{j < r} would give fw(x) as the pro- 
jection of X onto the smaller subspace defined as relevant by our outside 
structure. More generally, the eigenvalues Xj quantify the contribution of 
a direction vj to our outside structure S, and, therefore, the eigenvalues, or 
a monotone transformation of them, are a smoother way to assign relative 
importance to the different basis defined by S. 
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For two vectors x and y, the standard inner-product between fw(x) 
and fw(y) is given by 

(fw(x),fw(y)) = (x,y)vD2,vr, 

that is, the inner-product between x and y using the metric VD^V"^. Then 
the choice of a metric Qp = Xl is equivalent to the choice of weighting each Vj 

by and fw(x) = Q^^x. 

In this hght, we can compare the effect of using Qp = 5] versus Qp = 
Both obviously have the same eigenvectors and differ only in the weighting 
the eigenvectors {Xj versus 1/Aj). Thus, the choice of S as the metric for the 
variables places emphasis on the directions with more information about the 
outside structure, while emphases directions that are most independent 
of the outside information. Depending on whether this outside structure 
is thought to enlighten or confound the analysis, the different weighting 
systems are appropriate. 

From this harmonic perspective, the behavior of the eigenvectors is quite 
revealing as to the intuitive interpretation that can be placed on the analy- 
sis. Such a projection onto a relevant set of basis is, of course, analogous to 
harmonic analysis or wavelet analysis for functional data. PCA could also be 
described similarly, only with the Vj dependent on the observed variability 
of the data. In these cases, the basis functions can be ordered to hopefully 
reflect increasingly less meaningful variations of the data, so that the im- 
portant information in the data for the analysis in question is captured in 
the first few directions. More generally, eigenvectors of a covariance matrix 
describe linear combinations of decreasing variance, and thus presumably 
decreasing ability to reveal the structure of interest. 

Beyond the ordering of the eigenvectors, a desirable behavior for the pur- 
poses of interpretability is for the bases (eigenvectors) to be sparse — nonzero 
in a small portion of the coordinate space (or, more generally, a clearly inter- 
pretable subspace). If so, the resulting coordinates of the transformed data 
are easily interpreted as contrasts or combinations of a small set of variables. 
This is the appeal of wavelets or various sparse PCA algorithms. From the 
point of view of our outside information in the form of a graph or phyloge- 
netic tree, this means we want our representation of the outside information 
(via 51) to result in eigenvectors that are interpretable decompositions of the 
external information we have. As we will see, certain covariance structures 
for phylogenies and also graphs have such decompositions, which is one rea- 
son that the analysis in a nonstandard inner-product space can give highly 
interpretable results. 

3.3. gPCA and analyses of variables as observations. Another interpre- 
tation of Qp slightly different from the geometric one given above is that it 
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is simply an additional data matrix — one that defines similarities between 
the p variables — which we wish to include into our analysis of the primary 
data matrix, X. 

Pavoine, Dufour and Chessel (2004) accomplish this by their method of 
Double Principal Coordinates Analysis (DPCoA), which explicitly trans- 
forms the similarities between the variables given by Qp into a set of stan- 
dard Euclidean coordinates, Z G M^'^'', using MDS (also known as Prin- 
cipal Coordinates Analysis). This can be viewed as giving an alternative 
basis for and Z as the new set of coordinates of the original p variables 
in which X was measured. Then the next step of DPCoA transforms the 
data X to this new basis as well, that is, to coordinates XZ. DPCoA then 
performs PC A on the transformed X (we note that these steps are exactly 
the same as the steps of DPCoA, but generalized here to apply to general 
data matrices X and not just the contingency tables originally proposed; see 
Appendix A for details). 

The series of steps that make up DPCoA is exactly equivalent to a single 
gPCA of the centered data matrix, X, with the choice of metrics given by 
the triplet (X, Qp, Qn), provided that (1) the centered data matrix of X was 
the result of centering the columns (variables) and (2) the same centering 
matrix used in centering X was also used in the MDS of Qp to find the ma- 
trix Z (Appendix A). DPCoA was only proposed for the particular setting 
of ecological studies where the data is a contingency table, and, thus, cen- 
tering the columns of X is actually equivalent to centering the rows because 
of the row and column weights that are typically chosen for the centering 
(see Section 4.2), so the requirement is naturally satisfied. 

By recasting DPCoA as a gPCA, the technique now has general applica- 
tion and is clearly extendable, since in many situations heterogenous infor- 
mation can be similarly introduced into an analysis in this way. 

We note that MDS is traditionally described based on an input of squared 
dissimilarities or distances between points given by ap matrix 6; however, 
any positive definite Qp that can be written as 

Qp = lpv^ + vlj-i<5 

for some vector v G will result in the same MDS of the variables and thus 
the same DPCoA results. 

Another approach to analyzing two sources of data are multivariate ker- 
nel techniques, such as kernel CCA [Bach and Jordan (2002)], which assume 
that the only knowledge of the data is similarities between objects. In these 
techniques, two sets of data provide two different sets of kernel similarity 
matrices Ki and K2 on the same set of n objects, and the kernel analysis 
results in new coordinates yi and y2 that are linear combinations of these 
kernel similarities that best relate the two data sets (the prediction context 
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is also possible). Then gPCA of the rows of X results in equivalent coor- 
dinates for the rows as the choice of Ki = XQpX-^ and K2 = Qn, for an 
extreme form of regularization of the CCA problem that only constrains 
the norm of the resulting functional, rather than the more common 

constraint on estimated variance (see Appendix B). 

In the current setting, we are instead interested in outside information on 
the p variables in the form of Qp. In this case, the natural kernel analysis 
would provide new coordinates for the p columns based on Ki = X^Q„X 
and K2 = Qp, which would correspond to a gPCA of the columns. As we 
noted above, however, the row coordinates from a gPCA of the rows are 
recoverable from the gPCA of the columns. Like DPCoA, this perspective 
of gPCA is that of finding a new set of coordinates for the variables, based 
this time on explicitly relating the expected similarities to the observed 
similarities, and then rotating the matrix X into this basis. 

4. Analysis of species abundance. The investigation of species compo- 
sition and comparison of species across different locations, such as in our 
motivating example of the bacteria communities, form the core of ecological 
studies. A large contingency table of species abundances for different loca- 
tions is a common form of data in this literature. Development of gPCA as 
described here has often been in this setting, thus it is useful to review some 
important points before returning to our bacteria example. 

Our motivating example of the bacteria is ecological, but large contin- 
gency tables appear in many other situations. For example, in document 
classification, the data could consist of the frequency of different words in 
different documents. Another example is allele frequency studies with the 
frequency of different alleles of a gene in different populations. We will con- 
tinue to focus our notation and discussion on the phylogenetic/ecological 
scenario, but the methods presented here could be of use for these different 
data types. 

4.1. Notation. Assume that the abundance of certain species are mea- 
sured at L different locations and a total of S distinct species types are 
observed. We drop the use of n and p for the rows and columns of our data 
matrix to emphasize that there is not a canonical dimension that is consid- 
ered the observations in this setting, though we will focus on the locations 
as observations in our example. We will similarly use matrices Q5 and Ql 
for the row and column metrics. 

Let A be the resulting Lx S contingency table of the observed abundances 
of species s at location i. Because we are interested in comparing the species 
composition of the locations, we will represent each location by the relative 
proportion of the species in the location. A vector of relative proportions 
at location i is called a profile vector in the ecological literature and is 
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obtained by dividing each row of A by its row sum. The corresponding data 
matrix is given by X G M^^"^. Namely, let v^l = Al/A^ G be the row 
sums of A normalized to sum to one. Then X is given by 

X = I : I =D^iA/iVGM^x5, 
Vxl/ 

where D^^, is a diagonal matrix with diagonal elements given by wx, respec- 
tively. 

The vector also defines weights for each of the locations, and the 
weights are proportional to the total number of observations in that location. 
The weighted mean of the locations, x, is given by X-^'w/, and the centered 
data matrix, X, is given by X = (I — lw|^)X. 

4.2. A few important properties of contingency tables. The duality of 
rows and columns. Note that the weighted mean, x, also sums to one and 
therefore is itself a potential location profile. In fact, x is proportional to 
the column sums of A and thus is equal to the relative frequency of the 
species across all locations. If we had instead chosen to analyze the columns 
(species) as the observations, choosing weights W5 for the species in the 
same way as the rows, we would have = x. 

The equivalence of W5 and x has interesting repercussions for data analy- 
sis because under these weighting schemes, we can equivalently center either 
the rows or the columns, 

X = Pwj^X = XPwg, 

where Pw^ = (Im. — ImW^) is the projection matrix that centers m obser- 
vations based on a weighted mean with as weights. 

Interpretation of variables in gPCA. Because we analyze location profiles, 
there is a simple way to plot the variables (species) jointly with the obser- 
vations (locations). Let be the standard basis vectors of M*^. Then is 
also a profile vector representing a theoretical location that consists solely 
of species s. If we transform the data with an ordination technique, we can 
jointly transform and plot its transformation alongside the observed lo- 
cations. Unlike the usual plots of variables, the coordinates of our rotated 
axes have a meaning as a data point, not just as a direction in space, so 
we can legitimately visualize distances between the location and species in 
a single plot. 

Examples of gPCA with contingency tables. In addition to DPCoA de- 
scribed above, different metric spaces are often used for analyzing contin- 
gency tables via gPCA, particularly to retain additional information such as 
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the weights and/or w^. The most common example of gPCA is Corre- 
spondence Analysis (CA), which is a gPCA of the row profiles of a contin- 
gency table, and uses the triplet (X,D~^,Dw£) [see Greenacre (1984) for 
a detailed treatment]. This gives an inner product of the form x|'D~^x^, 
down- weighting the more frequent species. This can be seen as counteracting 
a "size effect" for frequencies, where abundant species dominate the anal- 
ysis; without this correction, differences in rare species (which will be on 
a smaller order of magnitude) are lost. 

One can argue that the weighting of CA places too much importance 
on low abundance species, even though those species are more likely to be 
miscounted and are probably less trustworthy. Gimaret-Carpentier, Chessel 
and Pascal (1998) propose no weighting of the species, only the locations, 
which gives a triplet (X,l5',Dw^) — ^just a regular PCA with weights on 
each observation. Such an analysis in ecology is also called Non-symmetric 
Correspondence Analysis (NSC A). 

4.3. Connection to diversity. We take a moment to comment on the 
connection of the choice of gPCA metrics to a common question in ecology — 
how "diverse" a location is. Diversity is a measurement of how close the 
distribution of species is to uniform. Two popular measures of diversity are 
variations of the Gini-Simpson index, f/'Gs(x) = 1 — X^f=iX(s)^, and the 

Shannon Diversity index, ii'sh(x) = X]f=i ^(s) log(x(s))- 

Ecology studies often use the individual diversity of locations to make 
comparisons, but the diversity indices alone do not effectively compare the 
species composition. Locations can have quite different composition of species 
but with same levels of individual diversity. Of interest is how the species 
composition changes, and ordination techniques are used to address these 
problems, but as a separate component of the analysis of the ecological 
data. However, the choice of diversity and the choice of gPCA parameters 
are closely connected, as pointed out in Pelissier et al. (2003). Namely, if Q^, 
is a simple diagonal matrix of weights on the locations, gPCA of (X, Qs, Ql) 
gives the best representation of a particular measure of dissimilarity between 
locations, and choice of this dissimilarity measure implies a diversity mea- 
sure, and vice versa. Pelissier et al. (2003) stated this for several specific 
ordination techniques, and we state it more generally for any choice of met- 
ric Q5 on M"^. Define diversity and dissimilarity measures for any positive 
definite matrix Qs = Q by 

i?Q(x) =x'^diag(Q) -x^Qx = ^X(r)Q(r^) -J2Q{rs)^{r)^{s), 

r rs 

DissQ(xfc,Xj) = (xfc -X£)^Q(xfc -x^). 

These are clearly closely related to the norm and inner-product defined with 
the choice of Q. With these choices of diversity and dissimilarity, the total 
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diversity across all locations is given by Hq{5(.) and can be decomposed into 
the average diversity of individual locations and plus the average of pairwise 
dissimilarities of locations, 

L L L 

^q(x) = l/2X]X]^iW^i^W DissQ(x£,X£) + ^w^(^)ifQ(x^). 

k=i i=\ e=i 

^Total ctwccn ^Within 

gPCA of (X,Q,Dw^) gives the best low-dimensional representation of /b, 
the average dissimilarity between locations (see Appendix C). 

We can define a F-style statistic, as in AN OVA, to test for significant 
dissimilarity between the locations [Legendre and Legendre (1998)] 

^^ (A^-1)/b _ 

Because the significance of F will generally be determined by permutation 
tests, this F-test is functionally equivalent to using I^/It, which has many 
appealing connections to standard measures. We describe a few of them 
below given originally by Pelissier et al. (2003) and Pavoine, Dufour and 
Chessel (2004): 

CA: For correspondence analysis, Q = results in a dissimilarity be- 
tween profiles measured by the distance, 

(xfc - X£)'^D;;^(xfe - x^), 

which has also been proposed for document classification. As is well known 
in CA, Ib = X^/A*", where is the x^-statistic for testing independence. 
The implied diversity measurement for a profile x is ^ Wg(^)X(,^)(l — X(^,)), 
which implies the total diversity It is simply S — 1. Thus, Ib/It is pro- 
portional to the statistic. 
DPCoA: As we saw before, DPCoA can be written in terms of a general 
Qs. If we write Qs = IpV-^ + vl^ — ^6 for some v G M"^ and species 
dissimilarities 6, as in Section 3.3, then we have that //q and Dissq are 
the Rao diversity and dissimilarity measures [Rao (1982)] given by 

-f^Q(x) = y^^(rs)X(r-)X(s), 

DissQ(xfc,Xj) = (xfc - yiif {-\8){-Kk - ^e)- 

Thus, gPCA with Qs results in differences between locations profiles be- 
ing down-weighted for the species that are similar to each other and up- 
weighted for very distinct species. Though stated in many individual steps 
and not a single gPCA as we do here, the DPCoA method was motivated 
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by searching for an ordination that maximized this notion of distance 
between observations. The ratio I-q/It is commonly called the FsT statis- 
tic [Martin (2002)] in biological applications and has been suggested for 
testing differences in bacterial communities, where 5 is usually chosen 
as the original measures of genetic distance between the sequences. The 
FsT statistic is also used in testing for differences of allele composition in 
human populations [Excoffier, Smouse and Quattro (1992)]. 
NSCA: Since NSCA is standard PCA, except for the weighting of the ob- 
servations, Q = I, and is equivalent to the Rao diversity and dissimilarity 
measures when all the species are equally distant from each other. The 
resulting measure of diversity in this case is the Gini~Simpson measure 
of diversity, -ffcs- The ratio Ib/It is equivalent to Kendall's r [D'Ambra 
and Lauro (1992)]. 

5. A metric for species related by a phylogenetic tree. Returning to our 
bacteria example, we want a matrix XI that represents the phylogenetic re- 
lationships of the species. As mentioned in Section 3, if we can model the 
covariance structure of data expected based on just our outside information, 
this provides a natural choice of Xl. The phylogenetic tree in fact is a repre- 
sentation of the process of evolution, for which many possible probabilistic 
models could be created. 

A common probabilistic model for the evolution of the value of a trait 
over time, due to Cavalli-Sforza and Piazza (1975), is one of a Brownian 
motion model over time, where at each speciation event the model assumes 
that the resulting sister species continue to evolve independently [for alter- 
native models of evolution, see Hansen and Martins (1996); Pavoine et al. 
(2008)]. This model gives a covariance structure for the trait as observed on 
the existing species (the leaves of the phylogenetic tree) and can be simply 
stated in terms of distances between species on the phylogenetic tree. More- 
over, the eigenvectors of this covariance matrix generally demonstrate nice 
localization properties relative to the tree, implying interpretable results in 
terms of the properties of the tree. 

Specifically, assume that there is a known phylogenetic tree describing 
the ancestral relationship of S extant species and that a trait of interest for 
these species has evolved over time according to the model of independent 
Brownian motion with the speciation as depicted on this tree. The S extant 
species are observed, and for each species s at a single time point t(5), the 
trait is measured, resulting in y^s)- Then the vector of trait values, y, follows 
a multivariate normal distribution with covariance between species r and s 
proportional to the total length of time that the evolutionary history of 
the two species were identical, cov(y(g), y^,^)) =a'^trs, where trs is the time 
at which the two lineages diverged, as measured from their most common 
ancestor. 
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We can write this covariance quite simply in terms of the topology of 
the tree and the length of the branches, assuming that the branch length is 
reflective of evolutionary time. Let S be the distance matrix of the leaves 
based on the distance of the shortest path between them on the tree. Then 
we can write the covariance matrix Xl as 

S = l/2(lt'^ + tl^-(5), 

where t G M.^ is the vector of the distance of each species to the root. 

This relationship between I] and S implies that gPCA with S as the 
species metric will decompose a Rao Dissimilarity, with dissimilarities be- 
tween species given as their distance on the tree. For the bacterial example, 
use of this distance has the effect of not declaring locations very different 
if the differences between locations occur in phylogenetically similar phylo- 
types. 

Properties of phylogenetic metric. We would like that the eigenvectors 
of I] be sparse in a useful way relative to the structure of the tree, for exam- 
ple, that they contrast sister subtrees of the phylogenetic tree and be zero 
elsewhere. Furthermore, we would like that eigenvectors give increasingly 
specific level of detail so that eigenvectors corresponding to larger eigenval- 
ues highlight deeper structure in the tree. Put together, these statements 
would imply that the eigenvectors offer a multiscale analysis of the tree, 
with eigenvectors corresponding to large eigenvalues interpretable as sum- 
marizing differences in the large initial partitions of the tree and smaller 
eigenvalues giving eigenvectors reflecting the distinctions between the later 
divisions of the tree. 

Several authors in phylogenetics have asserted that the eigenvectors of I] 
have this multiscale structure [e.g., Cavalli-Sforza and Piazza (1975); Rohlf 
(2001); Martins and Housworth (2002)], but only limited statements of this 
kind can be rigorously made about a phylogenetic tree with more than four 
leaves/species [see Purdom (2006) for a longer discussion]. But empirical 
observations of the eigenvectors show that they often do have some charac- 
teristics of this multiscale property; for example, S has a block structure 
which guarantees that the eigenvectors of S will, at a minimum, be nonzero 
for only one side or the other of the initial split in the tree (Appendix E). 
Beyond this, if we ignore the comparatively small values in the eigenvec- 
tor, eigenvectors corresponding to smaller eigenvalues do tend to divide the 
species into smaller and smaller closely-related groups based on the sign of 
the entries, though the groups do not exactly correspond to subtrees (see 
Figure 2). 

6. gPCA applied to bacterial data and phylogenetic tree. In Eckburg 
et al. (2005) our original analysis of the bacterial data was a gPCA of 
(X, 5],Dw^), which is equivalent to DPCoA choosing d to be the distance 
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Fig. 2. An illustration of some eigenvectors of S for the intestinal data. Only 25 of the 
395 eigenvectors are shown: those that correspond to the first five largest eigenvalues, the 
last two smallest eigenvalues, and then a random sample in between. Each row represents 
an eigenvector, and the value of each element of the vector is plotted alongside the phylotype 
with which it corresponds. Blue represents a positive value, red a negative. The width 
indicates the absolute value of the element. Again, each row has been normalized so that 
the maximum width is the same in each row. Next to each row is printed the corresponding 
eigenvalue. 
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Fig. 3. Scatter plot of the species and samples with the first two coordinates given by 
DPCoA. Species are shown as colored points in both plots. In plot (a), the samples are 
shown as the large blue shapes: different shapes indicate different patients and different 
shades of blue indicate location within the colon. In plot (b) , samples are represented as 
ellipses that indicate the major directions of the abundances of the samples. For simplicity, 
a single ellipse for the combined abundance m the biopsies is shown because the internal 
biopsies are very similar. 



among the phylotypes. We display in Figure 3 the ordination of the locations 
(samples) and species using the first two coordinates (using the implemen- 
tation of DPCoA in the ade4 package in R [Chessel et al. (2005); R De- 
velopment Core Team (2008)]). The first obvious fact is that the patients 
are separated, almost entirely, by just their value when projected onto the 
first axis. The first axis orders the patients B, C, A, which correlates with 
visual examination of the data in Figure 1. Below we will compare to other 
common choices of metrics and we will see that distinguishing the patients 
is not difficult since all of the techniques accomplish this, though not always 
in just one dimension. More interestingly, we also see in Figure 3 that the 
stool samples are distinguished from the internal biopsies of the colon, and 
the second axis seems to make this distinction. Again this makes sense from 
visually examining the data, since within each patient the stool samples do 
stand out from the biopsies. 

The most striking aspect of the plot from the gPCA is the additional 
information provided from the inclusion of the phylotypes in the plot. Recall 
that when our data matrix X consists of profile vectors, our original axes 
correspond to a location that is entirely concentrated in phylotype s. The 
coordinates of the phylotypes given by gPCA will be the coordinates of our 
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axis e<j centered and rotated like the observed profiles (see Appendix A). 
Looking at the ordination plot, we see that the phylotypes' coordinates 
provide an interpretation for the first two dimensions. The phylotypes are 
in clusters much like the groupings on the tree — not surprising if we recall 
that in the full space the distances between the species are exactly the 
distances on the tree. What is interesting is how the clusters on the tree 
fill the space once projected into these two coordinates that preserve the 
Rao Dissimilarity among the locations. The distribution of the phylotypes 
indicate the importance of these clusters in determining the dissimilarity 
between the patients. Those far from the origin have more impact in defining 
the coordinates of the locations. We see the tension between the various 
Bacteroides (blue) and the rest of the tree. 

Furthermore, we can interpret the relationship between the locations and 
the phylotypes. We see that patient B is comparatively much more in the 
direction of the Prevotallae-like bacteria (light blue), while the other two 
patients are more in the direction of the B. Vulgatus-like phylotypes (dark 
blue). Similarly, the biopsies are comparatively more heavily represented in 
the Bacteroides (blue) portion of the tree, while the stool samples are com- 
paratively less so. Figure 3(b) depicts the different samples as ellipses with 
the axes of the ellipses determined by the relative proportion of the different 
species for the location (see Appendix F). This illustration emphasizes that 
the samples can be thought of giving weights to each phylotype, and the 
ellipse demonstrates the relative influence of the different species. We see 
graphically the different influences of the two groups of Bacteriodes (blue) 
in separating the biopsies of patient B from all of the rest of the samples. 
Transforming the data in various ways before analysis does not dramati- 
cally change these relationships (e.g., log-transforming the data or adding 
pseudo-counts) . 

All of these visualizations have, by necessity, focused on only the first 
two dimensions of the coordinates given by gPCA. These dimensions do 
cover a large proportion of the Rao Dissimilarity, but still are only an ap- 
proximation of the full space. We are mainly focused on demonstrating the 
characteristics of the ordination procedure in terms of the coordinate sys- 
tem that it creates, but for more rigorous testing of differences between the 
patients or between the biopsies and stool samples, permutation tests based 
on the F-statistic described above would generally want to compare with 
the entire coordinate system. 

6.1. Comparison to other approaches. How do these results compare to 
the other ordination techniques mentioned above? In Figure 4 we show 
the results of the ordination from Non-symmetric Correspondence Analysis 
(NSCA), Correspondence Analysis (CA) and a Mahalanobis-like distance 
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Fig. 4. Coordinates of species and samples from alternative ordination techniques. 



based on S~ (see Section 5). We similarly center, rotate and project the 
axes to get the species coordinates in the same manner as DPCoA. 

As we mentioned, all of the techniques separate the three patients, but we 
see that the gPCA using the tree gives much more relevant results, both in 
terms of the role of the species and in relating to our intuitive interpretation 
of the data. The NSCA [plot (b)] is the same technique as our gPCA but 
with each species at equal distance from every other; it is also just a stan- 
dard PCA with weights on the observations. In the first two coordinates of 
the NSCA, we see that instead of having a smooth contribution from clus- 
ters of phylotypes, two individual phylotypes, far removed from the rest, 
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contribute to the division of the patients much more than the rest. The bulk 
of the species have httle contribution to these coordinates. Thus, there is 
httle from which to draw more general conclusions regarding the biological 
characteristics of the species which are influential. This is a consequence 
of treating each phylotype equally, rather than using the additional struc- 
ture of the tree to shape the analysis. CA [Figure 4(c)], on the other hand, 
spreads out the importance of each phylotype. Here we can see the effect 
of the down-weighting metric in CA discussed earlier; differences found in 
the many low abundance phylotypes are allowed to influence the analysis. 
Rather than a couple of phylotypes dominating the analysis, as in NSCA, 
the phylotypes play more equal roles. 

We might try to use any one of these techniques to reason out relationships 
among the variables. Each technique would give a different story in the role 
of the variables (phylotypes) dependent upon the assumptions inherent in 
the method. The relevant feature for our analysis is that we presuppose that 
a certain type of information is relevant — namely, how the structure of the 
tree relates to the data. This approach focuses the analysis on finding an 
interpretation among the variables that follows the tree structure. 

We note that the abundance table from metagenomic studies discussed 
here has many features common to high-throughput experiments in biology — 
in particular, the number of biological samples is quite low compared to the 
number of measurements. We sought to integrate the phylogenetic informa- 
tion into the data analysis a priori. In this way, the analysis is constrained 
in a biologically relevant direction. In contrast, we could think of analyzing 
this abundance data much like a microarray experiment: test each phylo- 
type individually for differences between the patients and use multiple test- 
ing criteria to identify individual phylotypes showing significant differences. 
A problem with this approach, which is also a common problem in mi- 
croarrays analyses, would be that a list of significant phylotypes is difficult 
to interpret. In microarray studies, biological interpretation is often done 
a posteriori by then examining biological knowledge of the list of genes. We 
could similarly use the phylogenetic tree in this way. However, we just saw 
that an analysis independent of the tree highlighted only a couple of specific 
phylotypes from which it would be difficult to build a general connection to 
the tree. 

6.2. Effect of the choice of metric. We can see the effect of using I] 
in our gPCA by examining the linear combinations that gPCA using I] 
chooses. For any ordination technique, let V be a matrix that rotates the 
original profiles X to give us the final ordination; in gPCA of centered 
data, this will be the matrix PwgQs'A. We examine the different linear 
transformations, Vj, from gPCA with Xl as compared to the transformation 
for a standard PCA on the data X (equivalently, NSCA). And we also 
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Fig. 5. Shown are the first five linear combinations of gPCA using S that act on the 
observations in X (the location profiles) to create the first five coordinates (vi). The five 
dimensions are divided by thick, dotted line. Also shown adjacent to each gPCA vector are 
the linear combinations from a standard PC A of^ (labeled 'X') and the eigenvectors ofS, 
(labeled 'T,'). 

compare to the eigenvectors of if the covariance between the species was 
exactly the 5] predicted by the evolution model, then these would be the 
principal components of such data. Thus, we can think of the eigenvectors 
of 5] as PCA on the tree. 

In Figure 5 we order the elements of v, from these three ordination tech- 
niques so that they line up with the phylogenetic tree. In this way we can see 
the relative importance of the phylotypes in transforming the data. When we 
look at the linear combinations for the first few coordinates, we see that the 
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principal components from our gPCA with S intuitively seem to be a trade- 
off between these two options, and we could think of this as a shrinking 
of the data variability in the "direction" of the tree. This is a particularly 
appealing idea, since we are treating the phylotypes as variables and there 
are far too many variables for the number of samples we have. 

Despite the intuitive results, the analysis depends on our choice of en- 
coding the tree using S (or, equivalently, for DPCoA, our choice of d). In 
particular, the block structure of S puts large emphasis on the first initial 
partition of the species at the root of the tree; these two groups of species 
are considered independent, conditional on the root ancestor. We can see 
this emphasis on this first divide from the Rao Dissimilarity based on 6, 
where these two lineages will be far away from each other and, thus, dif- 
ferences between will be accorded more weight in the analysis. However, as 
mentioned above, we see that the method depends only on S, so the defini- 
tion of the root of the tree, per se, is not the deciding factor, but rather the 
large amount of distance between these two subtrees. 

Changes near the tips of the tree, both in the numerical data and the 
definition of the tree, will have little impact on the gPCA. For the bacterial 
data that we are interested in, the deeper tree structure is more trustworthy 
than the structure near the leaves of the tree because of the approximate 
definition of species. It is a reasonable compromise to put more weight on 
the deeper structure of the tree, and base our analysis on this dependence, 
in exchange for resolving the more fundamental problem in our definition of 
the species. 

7. General graphs. It is clear that the same approach is applicable to 
other situations where there is complicated information that is related to the 
experimental data. By understanding our phylogenetic analysis as a specific 
example in a general approach to data analysis, we can compare with other 
techniques as well as take advantage of insights from other data situations. 

A closely related example is when we have not a phylogenetic tree, but 
a more general graph structure that describes the relationship of our vari- 
ables or observations. The analysis of experimental data in tandem with 
related biological networks by Rapaport et al. (2007) is equivalent to our 
metric approach. There, the authors used the Laplacian matrix associated 
with a graph to represent the biological graphs that related genes, where the 
the laplacian matrix L is given by Dd — A, where A is the adjacency ma- 
trix of the graph and d is the vector of degrees of each node. The Laplacian 
matrix is a natural choice for graphs; the eigenvectors have similar multi- 
scale properties as our metric for the phylogenetic tree. In Appendix D we 
briefly discuss the possibility of treating the phylogenetic tree as a general 
graph and using the Laplacian as a metric. We chose another approach here 
because such a choice does not well reflect the phylogenetic information in 
the tree. 
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A related application is found in spatial analysis, where spatial relation- 
ships between observations are based on neighborhood relationships, or, 
more generally, distances between points. While many analyses first remove 
the spatial dependencies so as to have independent observations, it is of- 
ten also of interest to evaluate the relationship between the spatial patterns 
and the observed data. When spatial connectivity between observations is 
simplified to a zero-one connectivity measure (usually based on a cutoff' on 
the distance between the observations), the spatial relationship is given by 
an adjacency matrix. Geary's c and Moran's I, two common measures of 
the spatial autocorrelation of y G M" (a variable observed on the n observa- 
tions), can be written in terms of the adjacency matrix [Thioulouse, Chessel 
and Champely (1995)], 

_ {n-l) EU EU Afa-)(yco - yp-))^ _ n - 1 y^(Dd - A)y 

2iv^E.(yw-y)' Ne Fy 

^ ("-) E"=i E"=i ^(ij)iy{i) - y)iy{j) -y) ^ n y^Ay 
2iVeE.(yw-y)' Fy ' 

where y = y — yln is y centered by the standard (unweighted) mean of the 
elements of y and = Eij Aij is twice the number of edges in the graph. In 
particular, we see that Geary's c can be written in terms of an inner-product 
using the Laplacian. 

Thioulouse, Chessel and Champely (1995) note that the variance of y with 
observations weighted by their node-degree, given by varo^ (y) = FDdy/-^e) 
can be decomposed into related components, 

varD,(y) = y ^^y + y j^y- 

Thus, Geary's c and Moran's / are similar to F measures described above, 
that is, the ratio of component variability to total variability (note, however, 
that Moran's / can be negative). Several authors have proposed spatial 
multivariate analyses which rely on L as a metric for the rows, or a row 
standardized version L* = D^^(Dd — A) [Aluja-Ganet and Nonell- Torrent 
(1991); Thioulouse, Chessel and Champely (1995); di Bella and Jona-Lasinio 
(1996); Dray, Said and Debias (2008)]. [Note that the matrix L* as well as 

- — ' _]^/9 1/2 

the similar matrix L = (Dd — A)Dj are also considered in graph 

theory; see Biyikoglu, Leydold and Stadler (2007).] 

8. Conclusion. There is a clear necessity for including phylogenetic in- 
formation in an analysis of metagenomic data. gPCA gives a simple and 
compelling way to accomplish this. We also see from our recasting of DP- 
CoA as a gPCA that the framework of gPCA allows for easy comparisons 
between seemingly disparate analyses as well as further exploration as to 
the effect of our choice of metrics. 
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The use of nonstandard metrics is quite natural in statistics and can be 
implemented in a variety of ways, PCA being merely the simplest. Common 
examples, such as Mahalanobis distance, are usually data-driven, but we see 
that metrics based on outside knowledge can be used to include complicated 
and heterogeneous information into the analysis of our numerical data. This 
kind of information can help to give more context to the data, particularly 
when the number of variables is large as compared to the samples. More- 
over, since the metrics here correspond to covariance matrices, probabilistic 
models give a simple approach for encoding information appropriately. Of- 
ten, as in the case of phylogenetic trees, the eigenvectors of such covariance 
matrices have nice localization properties that highlight the relevant spatial 
or regional patterns of the prior information. 

APPENDIX A: DPCOA AND GPCA 

We state here the equivalence between DPCoA and gPCA described 
in Section 3.3. First we describe more explicitly DPCoA, as described in 
Pavoine, Dufour and Chessel (2004). 

DPCoA. Assume that the squared pairwise distances/dissimilarities be- 
tween the species are given by a 5 x 5 matrix S. We also assume that the 
distances are Euclidean (i.e., coordinates can be found for the points so that 
the standard Euclidean distance between points is given by the square-root 
of the entries of 5). 

Following the notation provided in Section 4, let Pw,„ = (Im — ImW^) 
be the projection matrix that centers m observations based on a weighted 
mean with w.^ as weights: 

1. Find Euclidean coordinates of the species using a weighted version of Mul- 
tidiminsional Scaling, with weights for the species given by w^, typically 
[and as proposed by Pavoine, Dufour and Chessel (2004)] the relative 
abundance of the species in all the samples. Specifically, let U be the 
eigenvectors of 

Then the new coordinates of the species are given by the rows of Z G 
j^Sxs ^g* ^ g _ I jg ^]-^g dimension of the space required to contain the 

species). Then we have Z = D^JJ'^\JA^^'^ . Note that we could also start 
with a similarity matrix between species, Sv = Iv"^ + vl"^ — for any v 
that implies Sv is positive definite. Because 

PwSvP^=Pw(-5/2)P^ 

for any weights w and vector v the MDS will be equivalent. This is, 
of course, the standard equivalence between starting with a similarity 
matrix or dissimiliarity matrix in MDS. 
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2. Set the coordinates of the locations to be at the barycenter of the species 
coordinates. In other words, each location I is given coordinates that are 
the weighted average of the coordinates of all the species and the weights 
are given by the relative abundance of the species in that site (which is 
contained in the vector x^). Let the rows of the L x s* matrix Y contain 
the coordinates of the sites, so 

Y = XZ. 

The squared pairwise Euclidean distance between the locations using 
these coordinates will be equal to their Rao Dissimilarity using the dis- 
similarity matrix S. 

3. Find a lower-dimensional representation of the locations using a general- 
ized principal components analysis on the triplet (Y, I5, D^^), where 

is a diagonal matrix consisting of weights for the locations, wx, (again, 
typically the relative abundance of the locations in all the samples) . Let 
r = rank(Y). Then gPCA of (Y,l5,Dw£) gives the eigenvalue equations, 

Y^DwiYF = F*, YY^Dw^G = G*, 

(1) 

where F'^F = I^, G^Dw^G = I,. 

and Y = G^^'^^F'^ is the generalized SVD decomposition of Y. The final 
coordinates of the locations are given by 

L = YF. 

We also transform the coordinates of the species to get species coordinates 
(see Section 4.2), 

K = ZF. 



Lemma. The coordinates for the locations given by L in DPCoA using S 
are equivalent to the coordinates X = XSv A of the locations given by gPCA 
with the triplet (X, Sv,Dw^), where X = XP^g is the column centered ma- 
trix of data. Furthermore, the coordinates of the species given by DPCoA 
in the matrix K are equivalent to the coordinates obtained by centering and 
then rotating the original axes by the transformation implied from the 
gPCA o/(X,Sv,Dwi) so i/iat K = PwgSv A. 

Proof. The fundamental eigenequations for a gPCA of the triplet (X, 
Sv,Dw^) are 

X^DwiXSvA = A*, XSvX^DwiB = B*, 

(2) 

where A^SvA = I,., B^Dw^B = I^, 
so that XPwg = B*^/^A^ is the corresponding gSVD. 
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Siiic6 X! — X^P^^ , we see that B and G from DPCoA are both eigenvec- 



tors for the same matrix, XP^gSvP^cj^^Dw^ , implying that B and G are 



the Dw^-orthonormal eigenvectors of the same matrix. This imphes that the 
eigenvalues are the same = and that B and G are the same up to 
a sign change (assuming unique eigenvalues). 

The resulting coordinates for^the locations under DPCoA are given by L = 
YF = G*^/2. With gPCA of (X, Sv, Dw^), the location coordinates are X = 
XPwsSvA = B*^/2 ^^^^ therefore, we have that L = X — the coordinates 
of the locations are the same in the two methods. 

The coordinates for the species are given by DPCoA as the rotation of the 
coordinates given in Z by F: K = ZF. By the gSVD decomposition of Y, we 
can write F^ = ^^^^G^D^^Y and, similarly, B*-^/^ ^ xPwgSvA^-^ 
Remembering that ZZ"^ = Pws^Pwg) the final coordinates of the species 
from DPCoA are given by 

K = ZY^Dw^G^-^/^ = ZZ^X^DwiG*-^/^ 

= Pws^PwsX^Dw^G*-'/' 

= PwgSvP^gX Dw^XPviTgSv A = P^t, Sv A 

V ' 

=A* from (2) 

up to the sign change difference between G and B. □ 



APPENDIX B: KERNEL ANALYSIS AND GPCA 

Multivariate kernel methods seek a set of functions fi, ■ ■ ■ ,fk from our 
general data space X into M, such that the possible set of functions / form 
a Reproducing Kernel Hilbert Space with respect to a kernel function /C on Af 
[see Scholkopf and Smola (2002) for details]. The solutions for a multivari- 
ate Kernel CCA (or extensions) can be recovered from the eigenequations, 
assuming Kj are invertible 

K-/K2KiUi = K5,UiA 

with the constraint that 

ufK5,u, = r, 

and U2 is given by 

U2 = K-/KiUi(Arir^i)-i/^ 

where K^. = (1 — ^j)Kj/n + ^jl, and Fj are diagonals of normalization con- 
stants chosen by the user. Then the new coordinates of an object x from 
data set i are given by (/i(x), . . . ,fkix))^ = k-^Uj, where k(j) = ICi{x,Xj). 
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Let Ki = Qn, K2 = XQpX"^, and ^1=^2 = 1) then we have the eigenequa- 
tion 

(3) XQpX^Q^Ui = UiA. 

We see that these are equivalent to the gPCA equations, with Ui = B. Then 
the coordinates associated with the data matrix X from the kernel method 
are 

K2U2 = K2KiUi(Arir2 = xQpX^Q„UiA"V2(r,r2 

while those from the gPCA are 

XQpA = XQpX^Q„BA-i/2 

Choosing the scaling of the eigenvectors so that (FiFg ^) = I makes the 
solutions equivalent. 

APPENDIX C: INERTIA AND DISSIMIL ARIES 

We generalize the results of Pelissier et al. (2003) to show the derivation 
of the dissimilarity and diversity results above. 

In gPCA, the term inertia is used for the inter-point similarities, and the 
total inertia between points is defined as /(X, Qp, Q„) = tr(QriXQpX^) = 
^Aj. Then if X(j,) are the new coordinates of X restricted to the first r 

dimensions and X(_^) the remaining p — r dimensions, we can decompose 
the total inertia into the inertia of the first r dimensions and that of the 
remaining p — r, 

/(X, Qp, Q„) = /(X(j,-) , Ip, Q.„) + /(X(-_r) , Ip, Q„) 

r p 
i=l i=r+l 

and the first r dimensions give maximal possible inertia for r dimensions. 

Let Y G M^^'^ be the incidence matrix for the species variable, where ^(is) 
is an indicator of the ith observation being species s. Let Z E M^^^ be 
a similar such incidence matrix for the location variable. Then the iner- 
tia of the eigenanalysis of the triplet (Y,Q,D7v), where Y is the (non- 
weighted) centered Y and D^v is a diagonal matrix of elements, will be 
equal to i?Q(x). Regressing Y onto Z gives predictions Yz and residuals 
Y|^ = Y — Y^. Then the total inertia (It) can be broken into the inertia 
due to differences between locations (/b) plus the remaining inertia within 
locations (/w)) 

Inertia(Y, Q,DAr) = Inertia(Y2, Q, Dat) -|- Inertia(Y|^, Q, Dtv). 
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Note that = ZX, so that the inertia explained by Z is equal to the 
inertia of the eigenanalysis of X, 

Ib = Inertia(Y^, Q, Dtv) = Inertia(X, Q, Dw^)- 

This implies that the ordination procedures described above best preserve 
the between location dissimilarities defined by the metric Q. 

APPENDIX D: THE LAPLACIAN AND A LAPLACIAN FOR TREES 

The Laplacian matrix that is associated with the graph is given by L = 
D — A, where A is the adjacency matrix of the graph and D is the diagonal 
matrix consisting of the degree of each vertex. The spectral decomposition 
of L is closely related to certain properties of the graph; in particular, there 
are many results linking the eigenvalues of L with fundamental character- 
istics of the graph [see Diestel (2005)]. There are fewer explicit character- 
izations of the eigenvectors that hold for all graphs. In a general way, the 
eigenvectors corresponding to small eigenvalues of L represent large divi- 
sions in the graph (indeed, for Aq = 0, we have the eigenvector 1 which is 
an average of all the nodes); they tend to be zero for large portions of the 
graph and the nonzero components are the same sign distinct regions of the 
graph. Those eigenvectors corresponding to large eigenvalues tend be domi- 
nated by linear combinations of "close" nodes or smaller groups of nodes and 
represent the "noisy," small differences within neighboring vertices. Thus, 
the eigenvectors of the Laplacian have "multiscale" characteristics, partic- 
ularly those eigenvectors corresponding to the largest and smallest of the 
eigenvalues. For data x associated with a graph, with each element of x 
corresponding to a node in the graph, the metrics for a graph based on the 
Laplacian will usually put greater weight on the eigenvectors corresponding 
to small eigenvalues, for example, 1/Aj or exp(— l/Aj). This choice corre- 
sponds to the behavior of the eigenvectors. 

The Laplacian gives the covariance between nodes from a useful model 
for describing relationships among the nodes — a model of diffusion of in- 
formation through the graph. The covariance from this model is given by 
exp(— 2aL), known as the heat kernel of the graph [see Kondor and Lafferty 
(2002) for review]. Of course, this is equivalent to weighting the eigenvectors 
of the Laplacian with weight function exp(— aAj). 

A phylogenetic tree is, of course, a graph, and the Laplacian of a tree and 
the distances between nodes on a tree are quite simply related [Bapat, Kirk- 
land and Neumann (2005)]. Let St be the distance matrix of the patristic 
distances between all the nodes of the tree (internal nodes as well as the 
leaves), and let L be the Laplacian of the tree with weights l/d{r, s) on each 
edge. Then we have that 
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where for a phylogenetic tree v is —1 or 1 depending on whether the node 
is a leaf of the tree or not. 

However, since our data is observed on only certain nodes of the graph — 
the leaves of the tree — we need a metric that gives a relationship only be- 
tween the leaves. If we use the Laplacian as our phylogenetic metric, we 
would have to constrain ourselves to the portion of the metric that corre- 
sponds to the relationships between just the leaves, Lg. If we took as our 
metric the inverse of the Laplacian — which corresponds to an appropriate 
ordering of the eigenvectors by weighting each by 1/Aj — we have that L^^ 
is given by 

L^^ = C77"^ — l/2<55, where c = (81"^d5x/l)~^, 7 = ^tv, 

and Ss G M'^^'^ is the distance matrix restricted to the distances between 
leaves of the tree and Ssi S M"^^"^"^ is the distance matrix restricted to the 
distances between the leaves of the tree and S — 1 internal nodes of the tree. 
This is an expression somewhat similar to our similarity matrix for DPCoA, 
but note that a gPCA based on L^^ is not equivalent to DPCoA because 
p^^pT (jQgg jjQ^ vanish. 

However, restricting the metric to those portions dealing only with the 
leaves makes the metric difficult to interpret. The Laplacian restricted to 
the leaves will no longer have the same eigenvectors as the Laplacian and 
thus loses its connection to the behavior shown by the eigenvectors of the 
Laplacian. Furthermore, from the point of view of covariance modeling, the 
phylogenetic tree represents an evolutionary story that is more directly mod- 
eled by S. 



APPENDIX E: EIGENVECTORS OF S FOR A PHYLOGENETIC 



Note the block structure in S: if the root ancestor, TZ, has immediate 
descendants Vi and V2, then the covariance between any of the existing 
descendants of Vi and those of V2 will be 0. Thus, we can order the rows 
and columns of 5] so that 



where Si is a x 5i matrix, 5i is the number of extant species descended 
from Vi , and similarly with S2 . This means that the eigenvectors of S must 
be of the form 



where {'Vu}^!^ are the eigenvectors of Si and {'V2j}j^i are the eigenvec- 
tors of S2. Therefore, every eigenvector of S, at a minimum, must be only 
nonzero for one of the lineages. 



TREE 



(4) 




(5) 




or 
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Indeed, if we think back to the definition of S, the elements of the 
blocks I]i,S2 are themselves rank-1 perturbations of block diagonal ma- 
trices: 



where ci = dq-ilZ^Vi) and C2 = dq-{JZ,V2) (here we have assumed that t oc 1 
for simplicity). This same logic continues so that each sub-block can be 
written as a block matrix plus a rank-one perturbation. S thus consists of 
such nested rank-1 perturbations of block matrices. 

The claims in the literature for a relationship of the eigenvectors of I] 
to the partitions of the tree all stem from the comments of Cavalli-Sforza 
and Piazza (1975). They make assertions which they prove only in the case 
of a tree with four leaves {S = 4) and under the assumption of a constant 
rate of evolution (t oc 1). One assertion is true: for any terminal bifurcation 
node (a node whose two descendants are existing species or leaves of the 
tree), there is an eigenvector of SI that has elements that are positive for 
one of the species, negative for the other and zero for all other species. In 
addition, we see that because of the block structure, every eigenvector of 5], 
at a minimum, must consist of zero elements for one branch of the tree. 

Beyond this, Cavalli-Sforza and Piazza (1975) describe "usual" behav- 
ior of the eigenvectors, but their ideas do not scale as the size of the tree 
increases. The nested block structure of S still has the effect of creating 
eigenvectors with some structure to them, though not as easily classified as 
suggested in Cavalli-Sforza and Piazza (1975). Generally the structure of 
the eigenvectors will not be directly related to a partition in the tree. In 
practice, the eigenvectors often have some relation to the bifurcations of the 
tree, particularly the deeper (earlier in time) bifurcations and of course the 
terminal bifurcations. The other eigenvectors often have clumps of positive 
and negative elements that correspond to subtrees of the tree, and we often 
empirically see as the eigenvalues get smaller some sort of concentration of 
large values in only a few species. 



The ellipse plots given in Figure 2(b) are provided by the ade4 pack- 
age and represent a location vector, X£, as an ellipse. For completeness, we 
explain here what ade4 is plotting. 

Let the species coordinates as transformed into two dimensions by the 
ordination of the locations be given in the columns of K2 G M'^^^. Then the 
ellipsoid for x^ G R"^ is defined by 



(6) 




APPENDIX F: ELLIPSES IN DPCOA PLOTS 



v^(Kl^Dx,K2)-V = 1, 



where the ellipse is centered at x; . 
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This curve consists of the points with norm 1 in the Mahalanobis metric, 
only the estimate of the variance in Mahalanobis distance is calculated with 
weights on the points (species) given by x;. Equivalently, the ellipses in Fig- 
ure 2(b) will have major and minor axes in the direction of the weighted prin- 
cipal components of the coordinates of the species in K2, with the lengths of 
the axes given by the weighted standard deviation of the species coordinates 
in those directions (an ellipse defined by the equation x^^Qx = 1 will have 
major and minor axes in the directions of the eigenvectors of Q with lengths 
given by l/i/Ai, where Aj is an eigenvalue of Q). 
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